Facilitating Treebank Annotation Using a Statistical Parser
نویسندگان
چکیده
Corpora of phrase-structure-annotated text, or treebanks, are useful for supervised training of statistical models for natural language processing, as well as for corpus linguistics. Their primary drawback, however, is that they are very time-consuming to produce. To alleviate this problem, the standard approach is to make two passes over the text: first, parse the text automatically, then correct the parser output by hand. In this paper we explore three questions:
منابع مشابه
Wide-Coverage Grammar Extraction from Thai Treebank
Parsing is an important step for natural language understanding, including phrase alignment for supporting statistical machine translation. Ability on analysing real text by parser strongly depends on grammar. Treebank could be one of the sources for grammar extraction. However, treebank construction largely relies on human annotators intuitions. Different intuitions from multiple annotators br...
متن کاملImproving the complement/adjunct distinction in CCGbank
One of the challenges of adapting the Penn Treebank for a specific formalism is that the target annotation often requires information represented imperfectly or not at all in the original corpus. When this occurs, the information must either be guessed with heuristics, or annotated manually. Recently, a third option has become available, due to the release of resources that supplement the Penn ...
متن کاملA Collaborative Annotation between Human Annotators and a Statistical Parser
We describe a new interactive annotation scheme between a human annotator who carries out simplified annotations on CFG trees, and a statistical parser that converts the human annotations automatically into a richly annotated HPSG treebank. In order to check the proposed scheme’s effectiveness, we performed automatic pseudo-annotations that emulate the system’s idealized behavior and measured t...
متن کاملC-structures and F-structures for the British National Corpus
We describe how the British National Corpus (BNC), a one hundred million word balanced corpus of British English, was parsed into Lexical Functional Grammar (LFG) c-structures and f-structures, using a treebank-based parsing architecture. The parsing architecture uses a state-of-the-art statistical parser and reranker trained on the Penn Treebank to produce context-free phrase structure trees, ...
متن کاملارائۀ راهکاری قاعدهمند جهت تبدیل خودکار درخت تجزیۀ نحوی وابستگی به درخت تجزیۀ نحوی ساختسازهای برای زبان فارسی
In this paper, an automatic method in converting a dependency parse tree into an equivalent phrase structure one, is introduced for the Persian language. In first step, a rule-based algorithm was designed. Then, Persian specific dependency-to-phrase structure conversion rules merged to the algorithm. Subsequently, the Persian dependency treebank with about 30,000 sentences was used as an input ...
متن کامل